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Abstract 

Adaptive Monte Carlo schemes developed over the last years usually seek to en¬ 
sure ergodicity of the sampling process in line with MCMC tradition. This poses 
constraints on what is possible in terms of adaptation. In the general case ergodic¬ 
ity can only be guaranteed if adaptation is diminished at a certain rate. Importance 
Sampling approaches offer a way to circumvent this limitation and design sam¬ 
pling algorithms that keep adapting. Here I present a gradient informed variant of 
SMC (and its special case Population Monte Carlo) for static problems. 


1 Introduction 

Monte Carlo methods have been developed into one of the mainstream inference methods of 
Bayesian Statistics and Machine Learning over the last thirty years m. They can be used to approx¬ 
imate expextations with respect to posterior distributions of a Bayesian Model given data. The most 
widely used Monte Carlo Method for this purpose today is Markov Chain Monte Carlo (MCMC). 
In this approach, a Markov Chain is constructed which is ergodic with respect to the given posterior 
distribution. In parallel, a scheme for sampling from a sequence of target distributions, called Se¬ 
quential Monte Carlo (SMC), has been developed ||2|. SMC has traditionally been used for inference 
in time series models and for online tracking applications. However, there has been a considerable 
amount of research on using SMC for inference in static models as well Eiiiiia. In this paper, I will 
develop a powerful variant of SMC for static models making use of gradient information, dubbed 
Gradient Importance Sampling. 

The paper will proceed as follows. In subsection |l.l| I will give a short overview of the a simple well- 
known MCMC algorithm making use of gradient information, the Metropolis Adjusted Langevin 
Truncated Algorithm. In subsection |1.2| an exposition to Importance Sampling and SMC is given. 
Gradient IS and its variants are introduced in [^Related work, especially in adaptive MCMC and 
Importance Sampling, is discussed in Sectionl^ Gradient IS is evaluated and compared against 
previous algorithms in Section]^ The last section concludes. 

1.1 Langevin Monte Carlo 

Metropolis Adjusted Langevin Truncated Algorithm (MALTA) m and Hamiltonian Monte Carlo 
(HMC) I?) are two well known sampling algorithms that make use of gradient information. In 
the following, we will denote the target density as /. Often times, this will we the posterior of a 
Bayesian model which can be evaluated only proportionally by multiplying the prior and likelihood 
at a given a point. 

HMC is probably better known in the Machine Learning community, but it is notoriously complex 
and its description is beyond the scope of this pape. For a thorough introduction see e.g. imil. The 
special case of HMC however, MALTA, is closely related to the algorithm proposed in this paper 
and a concise introduction will be given. MALTA is a variant of the Metropolis-Hastings MCMC 
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algorithm where, given the current state of the Markov Chain X', a proposal for a new state X is 
sampled from the multivariate normal density 

( 7 (-|X') = N{-\X' + Diy log f{X')),G) 


where is a drift function. For consistency reasons, the MALTA variant used in the evaluation 
section will use D(V log f{X')) = log f{X') for 0 < 5 < 1. The covariance matrix C is hxed 
by the user prior to running the algorithm. The proposed new state X is then accepted with the usual 
Metropolis-Hasting acceptance probability 


min 


( f{X')q{X\X')\ 

\ ’ f{X)q{X'\X)j 


and recorded as a sample. If the proposed state is rejected, the chain remains at state X' (which is 
recorded as a sample again). The Markov Chain is ergodic with respect to /, i.e. the samples pro¬ 
duced are approximately from /, which is guaranteed by using the Metropolis-Hastings correction. 
The samples can be used to estimate the expectation H of some function of interest h with respect 
to the target density / using the law of large numbers as 


H = 


N 

h{x)f{x)dx « 1/iV^ h{Xi) 

i=l 


where Xi ranges over the samples and N is the number of samples. 


1.2 Importance Sampling and SMC 

Importance Sampling takes a different approach. Instead of trying to sample approximately from 
/, it samples from some proposal density q instead. Rather than correcting for the change of distri¬ 
bution using Metropolis-Hastings, the Importance Sampling estimator simply weighs each sample 
X by the so called importance weight w(X) = f(X)/q(X). In the case where / is not normal¬ 
ized, which is the usual case when estimating a Bayesian posterior, the self-normalized Importance 
Sampling estimator for iJ given by 




N 






Sequential Monte Carlo (SMC) ||2l builds on Importance Sampling and was originally devised to 
sample from a sequence of target distributions. For ease of exposition, I will first consider the case 
where the same target distribution is used at each iteration, a special case known as Population Monte 
Carlo (PMC). From this, an extension to sequence of targets is straight forward and given in Section 
2.1 for the case of static models (i.e. not time series). In Population Monte Carlo, we first gather 
a set of p samples (also called particles in SMC) Xi,..., Xp from proposal densities qi,... ,qp 
which are assigned weights w{Xi) = f[Xi)/qi{Xi). Instead of using these weighted samples 
directly with the Importance Sampling estimator to evaluate the integral of interest, we resample 
Ai,..., Xp with replacement according to their respective weights, adding the resulting set to a set 
of unweighted samples S. This is called Importance Resampling and produces a sample set that is 
approximately coming from the posterior i). Several methods exist for this step, the easiest being 
multinomial resampling. See m for a review including some theoretical results. Previous samples 
can now be used to construct proposal distributions for the next iteration. In the simplest case this 
could be centering a proposal distribution on a previous sample. The procedure is iterated until S is 
deemed large enough. The integral of interest can now simply be computed by 

N 


H= ( h{x)f{x)dxf^l/\S\ E w 

xes 


Moreover, the marginal likelihood Z of the data (also called evidence of the model or normalizing 
constant of /) can be approximated by the formula 




Z « 1/iVuj E^* 


i=l 
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Algorithm 1 Gradient IS algorithm 

Input: unnormalized density /, gradient V log /, population size p, p intial samples S, sample 
size m 

Output: list S of m samples 
while len{S) < m + p do 
Initialize P = List{) 

Initialize W = ListQ 

for * = 1 to p do 

(a) sample X' uniformly from last p samples in S 

(b) generate X ^ qt{-\X'), append it to P 
append weight f {X)/qt{X) to W 

end for 

resample p values from P with repl. according 
to weights and append samples to S 
compute Ct+i 

end while 

remove first p samples from S 


where Wi are the weights that have been gathered from the stage before resampling and is the 
total number of weights. 

A major argument for Gradient IS is the ability to approximate the marginal likelihood and the 
target distribution as good as or better than previous gradient-informed and/or adaptive sampling 
algorithms while being extremely simple to implement. For example, this opens the possibility to 
routinely compute Bayes factors (and thus do Bayesian Model selection) as a by-product of very 
efficient posterior sampling instead of using special inference techniques geared towards only com¬ 
puting Z. 

2 Gradient IS 

Gradient IS (GRIS) is a variant of Sequential Monte Carlo ||2|, or, when targeting the same density 
at each iteration, of its special case Population Monte Carlo GRIS accomplishes adaptivity by 
fitting a covariance matrix Ct to samples from the target density that have been gathered at time t. 
The proposal distribution for a new sample given an old sample X' is then given by 

qt{-\X') = 7V(.|A' + D{t,X log fix')), Ct) 

where D is a drift function. We used 

Dit, V log fix')) = (<5/i'-')V log fix') 

thus introducing a parameter ^ > 0 (and usually <5 < 1) which ought to be tuned for each target 
density. To fit Ct we use the parametrization 

Ct — < /q 

* \sdicoviXo,...,Xt-i) + eI) t>to 

For an initial Co and some to and where > 0 and e > 0 are tunable parameters. To update Ct 
with a new sample Xt, we can a recursion formula 

Ct+I = ^-^Ct + y {tXt-iXj_^ -it + l)XtXj + XtXf + el) 

Where Xt is the running average at iteration t (with an obvious recursion formula) ini. Directly up¬ 
dating either the Cholesky or Eigendecomposition of Ct results in significant computational savings, 
especially when the updates are done frequently. Both decompositions can be used to draw samples 
from and evaluate the density of the respective multivariate normal, both of which are prerequisites 
of the GRIS algorithm. A complementary method to tradeoff speed and adaptivity is to collect more 
than one new sample before updating. GRIS then proceeds as given in Algorithm[T] 
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2.1 Gradient IS with a sequence of target distributions 

Instead of using the correct target distribution / at every IS iteration like in PMC, it is also possible to 
construct a sequence of intermediate target distributions where gr = /■ In the existing SMC 

literature addressing static targets, only the samples acquired when targeting the actual distribution 
of interest are then used for estimating the integral H ilia. This approach is very robust when 
compared to MCMC algorithms, as it explores the probability space better laia. 

One possibility for constructing a sequence of distributions is the geometric bridge defined by gt cx 
for some initial distribution po and where {pt)J=i is an increasing sequence satisfying 
Pt = 1. Another is to use a mixture pt cx (1 — pt)go + Ptf- When / is a Bayesian posterior, 
one can also add more data with increasing t, e.g. by defining the intermediate distributions as 
gt{X) = f{X\di,... where D is the number of data points, resulting in an online inference 

algorithm 0. 

However, when using a distribution sequence that computes the posterior density / using the full 
dataset (such as the geometric bridge or the mixture sequence), one can reuse the intermediate sam¬ 
ples when targeting gt for posterior estimation using a simple trick. As the value of / is computed 
anyway for the harmonic bridge and the mixture sequence, we can just use the weight f{X)/qt{X) 
for posterior estimation while employing gt{X)/qt {X ) to inform future proposal distributions. This 
way the evaluation of / (which is typically costly) is put to good use for improving the posterior 
estimate. To the best of my knowledge, this recycling of samples in SMC for static targets has not 
been reported in the literature. 


3 Related work 

3.1 Adaptive Monte Carlo using ergodic stochastic processes 

Adaptive MCMC algorithms use samples acquired so far in order to tune parameters of the sam¬ 
pling algorithm to adapt it to the target. The first algorithm of this class, the Adaptive Metropolis 
(AM) Algorithm lIT^ fits a Gaussian approximation with mean Xt and covariance matrix Ct to the 
samples acquired up until iteration t, in exactly the same fashion as GRIS. The proposal distribution 
then is the same as that of GRIS, except for the fact that no gradient information is used, i.e. the drift 
function is D{-, •) = 0. A proposal then is accepted or rejected using the Metropolis-Hastings cor¬ 
rection, which is another difference from GRIS, which uses importance weighting instead. The AM 
algorithm is the first adaptive MCMC algorithm and is provably ergodic wrt the target distribution 
./• 

A similar algorithm making use of gradient information is the adaptive Metropolis Adjusted 
Langevin Algorithm with truncated drift (adapt. MALTA) lfT3l . Adaptive T-MALA uses a proposal 
distribution given by N{-\Xt-i + ^D{Xt-i),Ct), where D{-) is a drift function. The covari¬ 
ance Ct is constrained to lie in the convex compact cone defined by the projection P 2 {Ct) = C't if 
\C't\ < Ai wdp2{Ci) = j^C'j else for some parameter Ai and where | • | is the Frobenius norm. 
Similar constraints and accompanying projections are given for adapting the parameter and fitting 
the target sample mean Xt (for details see ini). 

Informally, adapting the Markov Kernels used in MCMC will result in an algorithm that is ergodic 
with respect to target / in the general case as long as 

1. all Markov kernels used are ergodic wrt target / (have / as their stationary distribution) 

2. adaptation diminishes, e.g. adaptation probability pt for iteration t satisfies pt —>■ 0 as 
t —>■ 00. 

(Theorem 1 in iflTl '). We can still have as N ^ oo, i.e. infinite overall adaptation 

(as in El). It is important to note that the diminishing adaptation is sufficient but not necessary. 
Other schemes might still be ergodic, one example being the AM algorithm, where adaptation is not 
diminished. 
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3.2 Adaptive Importance Sampling Schemes 

Adaptive Importance Sampling schemes in related to SMC have mostly been trying to fit a closed- 
form approximation to the posterior for usage as a proposal distribution in the next iteration. In 
particular, previous work used optimality criteria such as minimization of KL-divergence between 
proposal distribution and posterior to asses the quality of the proposal distribution. After collecting 
new samples, D-Kernel approximations im or Mixture distributions (such as Mixtures of Gaus- 
sians or Student-T mixtures) El are updated to fit the posterior better. In another line of research, 
better ways of using the produced samples have been sought. A particularly interesting approach is 
Adaptive Multiple Importance Sampling lfT8l[T9ll . Here, samples from previous proposal distribu¬ 
tions are reweighted after the sampling is finished in order to reflect the overall proposal distribution 
used across iterations. This is achieved by simply assuming that each weighted sample is produced 
by a mixture over the proposal distributions used, where the mixture weight is proportional to the 
number of samples actually drawn from a particular proposal. This reweighting has been shown to 
be consistent and considerably improve consecutive estimates El- 


4 Evaluation 


For an experimental evaluation of the adaptive Gradient IS algorithm, I targeted three synthetic 
distributions (i.e. not posteriors based on some dataset) and one posterior of a Bayesian Logistic 
regression model. The synthetic distributions have the advantage that the expected value H of the 
target is known in closed form, for the Logistic Regression dataset ground truth was estimated in a 
separate Monte Carlo experiment. I started the simulation at H to avoid having to handle burn-in 
for MCMC and ran 20 Monte Carlo simulations with 3000 target or gradient evaluations each. If 
the target and its gradient are evaluated at the same point, this is counted as a single evaluation. 
This is justified by the fact that most of the time either likelihood or gradient calculation dominate 
complexity and computing the other value uses only little additional time (since factors common to 
both likelihood and gradient can be cached). I compared to the AM algorithm, adaptive T-MALA, 
and Hamiltonian Monte Carlo 13 MCMC algorithms, where I tuned the parameters to give optimal 
acceptance rates as reported in the literature. GRIS was tuned for low estimates of Monte Carlo 
variance, an information that is available in typical black-box situations (i.e. where nothing is known 
about the target / except the information produced by the sampling algorithm). Is used the variant 
of GRIS that did not use a sequence of target distributions as it gave good results and was easier to 
implement. 


4.1 Maximum squared errors aud Effective Sample Size 


Estimating Effective Sample Size (ESS) is easy in traditional Importance Sampling and does not 
require establishing ground truth. Given importance weights of N collected samples and 

their normalized version Wi = ruj/ compute ESS as 


ESS'^ = 




If all the weights are the same (i.e. we have a sample coming from our target distribution), we have 
ESS^ = N, whereas if only one of the weights is non-zero, ESS^ = 1. A necessary condition for 
this estimate to be valid however is that drawn importance samples are iid. Eor MCMC a similar 
estimate (under the same name) is available after establishing ground truth for an integral of interest. 
An exposition to this is given in EOl whom we will follow. Given a function h which we want to 
integrate wit the target density /, and samples Xi ,..., from a Markov Chain targeting /, that 
follow the distribution qmc we can use the estimate 


ESS“‘= 


N 




where is the autocorrelation of h under qmc at lag i- If the autocorrelation is exactly 0 at each 
lag (i.e. all of the samples are independent), then again ESSjv*" = N, as desired. An estimate of 
autocorrelation is given by 
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Gradient adapt. (IS) MALTA adapt. (MCMC) 


Figure 1; Gaussian mixture target; Contour plot Figure 2: Gaussian Mixture target; SE perfor¬ 
mance. Algorithms not shown are widely off 
scale. 


In this autocorrellation estimate we make use of the ground truth values E[h(Ai)] and V[/i(2f)] 
which can only be estimated by a long run of some Monte Carlo method when / is a black block 
(for example when it is given only proportionally as a Bayesian posterior). Now as the estimate of 
autocorrelation is noisy for large lags, Il20l suggests to truncate the autocorrelations at the smallest 
cutoff c with pj? < 0.05, yielding the estimator 


^ i + 2ELi(i-^)^ 

-MC 

Now as ESS is a univariate measure, ETl 12011 suggest the following approach; calculate ESS for 
each dimension separately and take the minimum across dimensions. Because usually in a bayesian 
setting one is interested in estimation of variance, the suggested procedure is to do this for an es¬ 
timate of both expected value and variance of / and take the minimum of those as the final ESS 
estimate. 

Ironically, an ESS estimate does not exist for Sequential Monte Carlo when targeting the same 
density across iterations (the situation is different in a dynamic model with multiple targets). Eor 
one, it is not possible to use the estimate ESS*® because SMC induces dependency between samples. 
Also, ESS*^*" is not usable because the dependency structure in SMC, and thus the computation 
of autocorrelation, is much more complicated than in MCMC, as a sample at one iteration of the 
algorithm can spawn several samples in the next iteration. 

However, using ground truth estimates V[/i(2Q] and E[h(Ai)] as ESS*^*' does, it is possible to repro¬ 
duce the worst case properties suggested by ll^l20l . To this end, one measures maximum squared 
error (MaxSE) by computing, for each Monte Carlo run, squared errors with respect to V[/i(X)] and 
E[/i(2f)] and taking the maximum across those and across dimensions. 


4.2 Gaussian Grid 

The first target was a mixture of 25 2D Gaussians with equidistant means and mixture weights 
depending on distance from origin, a contour plot for which is given in Figure [T| A box plot for 
the squared error across different numbers of target evaluations is given in Figure]^ where the box 
ranges from 1st to 3rd quartile and whiskers extend to 1.5 * IQR (interquartile range). Hamiltonian 
Monte Carlo (HMC) was not plotted here, because the algorithm performed much worse than the 
other three candidates and proper scaling was impossible. 
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Figure 3: Gaussian mixture target: Squared bias, MSE and variance as a function of number of 
target evaluations 




Figure 4: Banana target: Contour plot Figure 5: Banana target: SE performance 


For this target, the performance was very close for the adaptive algorithms. Adaptive Gradient IS 
exhibits smallest MSE and variance (see Figure [^, but the AM algorithm is a close second. As 
multimodal targets are traditionally problematic for gradient informed sampling algorithms, it is 
interesting to see that adapting to the posterior covariance structure of the target can help mitigate 
problems in this case. The particularly weak performance of HMC possibly stems from the algo¬ 
rithm getting stuck in different modes for different runs, explaining in particular the high variance 
(Figure]^. 

4.3 Banana 

The 2D Banana shaped unnormalized measure was given by f{x) = exp(—— ^{x 2 — b{x\ — 
s))^. The measure is determined by parameters b and v which where set to 100 and 0.03, respectively 
(see Figure]^ for a contour plot). For unimodal targets, gradient based samplers are traditionally 
strong, though that advantage might be irrelevant when we adapt a scale matrix for a simple random 
walk algorithm. As is evident from Figures|^and|^ the simple AM algorithm actually is competitive 
with adaptive T-MALA for this target. However, adaptive Gradient IS again shows gains here when 
compared to the other samplers. This is particularly encouraging since T-MALA and Gradient IS are 
using the exact same drift function and T-MALA adapts more parameters than Gradient IS does. If 
the remaining parameters of Gradient IS can be adapted automatically, further improvements might 
be possible. 

4.4 Mixture of T distributions 

The third target considered was a mixture of three lOD multivariate f-distributions with different 
means, 10 degrees of freedom and varying mixture weights. The scale matrices where drawn from 
an Inverse Wishart distribution with identity scale matrix and 10 degrees of freedom, yielding mix¬ 
ture components with strong correlations among dimensions. A contour of the marginal density 
of the last two coordinates is given in Figure |7] This case was considered because f-distributions 


7 






































HMC (MCMC) 

^H AM (MCMC) 

• Gradient adapt. (IS) 

*—* MALTA adapt. (MCMC) 





• 31—■-^-J 

5 6 7 8 

log # Ihood evals 


Figure 6: Banana target: Squared bias, MSE and variance as a function of number of target evalua¬ 
tions 
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Figure 8: f-Mixture target: SE of mean estimates, 
Figure 7: 10 D t-Mixture target: Contour plot ofaveraged across dimensions 
the marginal density of the last two dimensions 


have heavier tails than Gaussians. As a standard rule of thumb, IS proposal distributions should 
have heavier tails than the target to ensure finite variance of the estimate m. With a mixture of 
f-distributions, I wanted to experimentally check for any problems stemming from the Gaussian 
proposal used in all algorithms. GRIS is better when averaging squared error across dimensions 
(Figure 1^. Also, when when comparing maximum log squared errors GRIS is clearly an improve¬ 
ment over previous algorithms (Figure]^. 



•O.5I-^^-'-'- 

MALTA adapt. (MCMC) AM (MCMC) Gradient adapt. (IS) HMC (MCMC) 


Figure 9: f-Mixture target: Squared bias, MSE and variance as a function of number of target 
evaluations 
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MALTA adapt. (MCMC) AM (MCMC) Gradient adapt. (IS) HMC(MCMC) 


Figure 10; Logistic regression; Squared errors ofFigure 11; Logistic regression; Maximum log SE 
mean estimate averaged across dimensions. Algo-across estimates of posterior variance and mean 
rithms not shown are widely off scale. and across dimensions. 


T Mixture lOD 



- 1008.0 

- 1008.2 

- 1008.4 

- 1008.6 

- 1008.8 

- 1009.0 

- 1009.2 

- 1009.4 

- 1009.6 

- 1009.8 


German Credit LR 25D 



# Ihood evals 


Figure 12: Evidence estimates 


4.5 German Credit Dataset: Logistic regression 

The German Credit dataset was used with the Logistic Regression model developed in This 
model exhibited a 25D posterior distribution which allowed for a Laplace approximation. To find 
ground truth, I collected ordinary Importance Samples from a mixture between the Laplace Approx¬ 
imation and the sample approximation with slightly increased scale, a method known as defensive 
Importance Sampling ll22l . The Effective Sample Size of this approximation was over 66, 000. 

4.6 Evidence Estimation 

Evidence estimates using GRIS quickly stabilized. I assesed MSE of evidence estimates for the 
f-Mixture target and the Logistic Regression model (Figure [T2]). The log evidences where —1000 
and —504, respectively. 

5 Conclusions 

In this paper, I presented Gradient IS, a variant of Sequential Monte Carlo. The algorithm uses 
gradient information and a covariance matrix adapted to collected posterior samples to construct a 
multivariate normal proposal distribution for SMC with static target distributions. GRIS was shown 
to give very good performance in posterior sampling and provide stable estimates for the normalizing 
constant of the target distribution (also known as model evidence). 
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